Towards Automating Exploratory Data Analysis

Paul M Washburn

May 6, 2019

auto_explore Machine learning practitioners need first to identify signal in their datasets before building models.

Iteration cycle time matters in the development of machine learning solutions. This work is a first attempt at accelerating this cycle time for a range of dataset types.

It is assumed the user is inside a Jupyter Notebook REPL environment.

Towards Automated Exploration

The primary goal of auto-explore is to to establish a codebase that reduces the effort to produce a reasonable first-pass exploratory data analysis for a variety of dataset types.

This Python library is a first attempt at automating the process of exploratory data analysis – at least as far as computation and visualization is concerned.

Critical thinking is not included.

Potential Benefits of Semi-automated EDA

  • Faster time to insights & modeling
  • Shorter exploratory data analysis turnaround
  • Reliable processes that are vetted & improved over time
  • No need to re-configure old code to new situations
  • Supplies a base for more in-depth analysis

Inspirational & Instrumental Works

This work necessarily relied upon many excellent open source Python libraries. Some of the inspirations and instrumental tools have been:

  • pandas-profiling: This work was ultimately not used in this library due to neglect of the project.
  • featexp: A great tool for visualizing univariate analyses of a target; useful pre-ML. This library was integrated into this one for experimentation on extending the work, and is used in producing univariate plots and feature selections.
  • matplotlib: The bare-bones go-to viz package for Python. This package was relied upon heavily to produce this library.
  • seaborn: Abstracts away from matplotlib by performing statistical analysis alongside plot generation. This package is also relied upon heavily.
  • pandas: The go-to data wrangling tool for Python. pandaspd.DataFrame object as well as many of the tseries capabilities are leveraged.
  • sklearn: An indispensable machine learning library that every data scientist should be proficient in using. All the functionality desired from this package has not yet been integrated into auto-explore, but there are plans to leverage this package more in the future.

Overview The goal was to simply specify a dataset and a few attributes and get analytic visualizations for free.

Much of the functionality of this library is to generate visualizations that are useful in understanding data. However there is a great deal of analytical functionality included as well, including some lightweight machine learning.

Linear Regression Analysis

lmplot

Text Analysis

tsne

Time Series Analysis

plot_tseries_over_group_with_histograms

Correlation Analysis

correlation_heatmap

Correlation Analysis

scatterplotmatrix

Categorical Analysis

target_distribution_over_binary_groups

Clustering Analysis

cluster_and_plot_pca1

Clustering Analysis

cluster_and_plot_pca2

Clustering Analysis

elbow

Helper Functions

Helper Functions

Functionality & Use Using the AutopilotExploratoryAnalysis object will make many methods available on your data with minimal set up.

However each module in this library can be used separately without ever instantiating this object.

Streamlining functionality into this object is still in alpha stages and is by no means perfect. True automation is a ways away.

auto_explore.eda

This part of the library was designed to be an interface to automatic exploration.

Simply specify a DataFrame and a list for each of its binary, categorical, numerical and text columns. If applicable, set the target_col as a list with one element (string).

This object makes available 17 methods for use on the df supplied as an arg. Check out the most recent code on Github.

Other Modules

The actual functionality of the AutopilotExploratoryAnalysis object has been abstracted away and modularized into various files within the auto_explore package. This allows for re-use of the code even if the automated option is not taken.

  • auto_explore.viz - Contains all the visualization functions useful in EDA
  • auto_explore.featexp - A copy featexp code with custom changes; possible pull request in the future
  • auto_explore.apis - Code that fetches data and machine learning models from sources
  • auto_explore.notebooks - Formatting code for inside a Jupyter Notebook REPL environment
  • auto_explore.stats - Currently only houses best_theoretical_distribution
  • auto_explore.datetime - Houses code pertaining to time-series feature generation
  • auto_explore.diligence - Houses code that performs sanity checks of various sorts

Future Work The goal for this library is to automate as much of the EDA process as possible for as wide a range of dataset types as possible.

The library is not quite there yet.

Until then efforts will be made to abstract and integrate into this code base as many EDA tasks as possible, and eventually have a full_suite_report mechanism.

Questions? “The part that is stable we shall predict. The part that is unstable we shall control”

John von Neumann